Nvidia Resource Renaming

备注

Configuring the NVIDIA Device Plugin to report Different GPU Models with Volcano Capacity Scheduling

Volcano v1.9.0 introduces capacity scheduling capabilities, allowing you to configure different quotas for various GPU types (essential in production environments). For example:

apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: queue1
spec:
  reclaimable: true
  deserved: # set the deserved field.
    cpu: 2
    memory: 8Gi
    nvidia.com/t4: 40
    nvidia.com/a100: 20

However, the default Nvidia Device Plugin reports GPU resources as nvidia.com/gpu, which does not support reporting different GPU models as shown in the example.

This guide walks you through configuring the NVIDIA Device Plugin to report different GPU models and integrating it with Volcano's capacity scheduling.

Install a Customized Device Plugin

In this section, we will cover how to install a customized Device Plugin using Helm. Instructions for installing helm can be found here. For prerequisites to install the Device Plugin, refer to the documentation for detailed instructions.

If you are using the GPU Operator to manage all GPU-related components, there are two options. One option is to disable the GPU Operator's management of the Device Plugin and follow the steps in this section. The other option is to modify the GPU Operator configuration accordingly.

1.1 Install a Custom Device Plugin

Begin by setting up the plugin's helm repository and updating it as follows:

helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm repo update

Then verify if the version v0.16.1 on which the modified release of the plugin is based exists:

$ helm search repo nvdp --version 0.16.1
NAME                           CHART VERSION  APP VERSION    DESCRIPTION
nvdp/nvidia-device-plugin      0.16.1     0.16.1        A Helm chart for ...

Next, prepare a ConfigMap file, assuming it's named config.yaml. A typical configuration is as follows:

version: v1
flags:
  migStrategy: "none"
  failOnInitError: true
  nvidiaDriverRoot: "/"
  plugin:
    passDeviceSpecs: false
    deviceListStrategy: "envvar"
    deviceIDStrategy: "uuid"

Once this repo is updated and the ConfigMap is prepared, you can begin installing packages from it to deploy the volcano-device-plugin.

helm upgrade -i nvdp nvdp/nvidia-device-plugin \
    --version=0.16.1 \
    --namespace nvidia-device-plugin \
    --create-namespace \
    --set gfd.enabled=true \
    --set image.repository=volcanosh/volcano-device-plugin \
    --set image.tag=v1.1.0 \
    --set config.default=default-config \
    --set-file config.map.default-config=config.yaml

Here is a brief explanation of some of the configuration parameters:

| Command | Usage | | -------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | --set gfd.enabled=true | Enables the GFD feature. If you already have NFD deployed on your cluster and do not wish for it to be pulled in by this installation, you can disable it with nfd.enabled=false. | | --set image.repository
--set image.tag | Replaces the official image with the modified version. | | --set config.default
--set-file config.map | Sets the ConfigMap for the Device Plugin to configure resource renaming rules. |

If you wish to set multiple ConfigMaps to allow different policies for different nodes or configure Time Slicing, refer to the documentation for more information.

The Volcano community provides an image based on NVIDIA Device Plugin 0.16.1. If this version does not meet your needs, you can refer to the discussion and create your custom image that supports Resource Naming.

1.2 Set Resource Renaming Rules

Once the Device Plugin is deployed, to report resources like nvidia.com/a100, you need to update the ConfigMap to provide renaming rules, for example:

version: v1
flags:
  migStrategy: "none"
  failOnInitError: true
  nvidiaDriverRoot: "/"
  plugin:
    passDeviceSpecs: false
    deviceListStrategy: envvar
    deviceIDStrategy: uuid
# new
resources:
  gpus:
  - pattern: "A100-*-40GB"
    name: a100-40gb
  - pattern: "*A100*"
    name: a100

The pattern field can be checked on your node by running the command:

$ nvidia-smi --query-gpu=name --format=csv,noheader
A100-SXM4-40GB
...

Patterns can include wildcards (i.e. ‘*’) to match against multiple devices with similar names. Additionally, the order of the entries under resources.gpus and resources.mig matters. Entries earlier in the list will be matched before entries later in the list. MIG devices can also be renamed. For more information, you can read the documentation.

After modifying the ConfigMap, you should observe that the node is now reporting nvidia.com/a100 resources, indicating that the configuration was successful.

1.3 Clean Up Stale Resources

If you previously installed the Device Plugin and reported nvidia.com/gpu type resources, you might notice that nvidia.com/gpu is still retained after reconfiguring the Device Plugin. For a detailed discussion on this issue, refer to here.

allocatable:
  nvidia.com/gpu:   0
  nvidia.com/a100:  8

To clean up stale resources, you can start kubectl proxy in one terminal:

$ kubectl proxy
Starting to serve on 127.0.0.1:8001

And in another terminal, run the cleanup script (note / needs to be escaped as ~1):

#!/bin/bash

# Check if at least one node name is provided
if [ "$#" -lt 1 ]; then
  echo "Usage: $0 <node-name> [<node-name>...]"
  exit 1
fi

# Prepare the JSON patch data
PATCH_DATA=$(cat <<EOF
[
  {"op": "remove", "path": "/status/capacity/nvidia.com~1gpu"}
]
EOF
)

# Iterate over each node name provided as an argument
for NODE_NAME in "$@"
do
  # Execute the PATCH request
  curl --header "Content-Type: application/json-patch+json" \
       --request PATCH \
       --data "$PATCH_DATA" \
       http://127.0.0.1:8001/api/v1/nodes/$NODE_NAME/status

  echo "Patch request sent for node $NODE_NAME"
done

Pass the node name and clean up:

chmod +x ./patch_node_gpu.sh
./patch_node_gpu.sh node1 node2

2. Configure DCGM Exporter for Pod-Level Monitoring

After changing the GPU resource name, the DCGM Exporter is unable to obtain pod-level GPU usage metrics.

The reason is that, by default, the DCGM Exporter must fully match the resource name nvidia.com/gpu or those with the prefix nvidia.com/mig-. For more details, refer to here.

Starting from version 3.3.7-3.5.0, you can configure the DCGM Exporter by adding a string array of nvidia-resource-names in the command line parameters or environment variables.

containers:
  - name: nvidia-dcgm-exporter
    image: 'nvcr.io/nvidia/k8s/dcgm-exporter:3.3.7-3.5.0-ubuntu22.04'
    env:
     ...
      - name: NVIDIA_RESOURCE_NAMES
        value: >-
          nvidia.com/a100,nvidia.com/h100

3. Configure Volcano to Use the Capacity Scheduling Plugin

After completing the above configuration, you can edit the volcano-scheduler-configmap to enable the capacity plugin:

kind: ConfigMap
apiVersion: v1
metadata:
  name: volcano-scheduler-configmap
  namespace: volcano-system
data:
  volcano-scheduler.conf: |
    actions: "enqueue, allocate, backfill, reclaim"
    tiers:
    - plugins:
      - name: priority
      - name: gang
        enablePreemptable: false
      - name: conformance
    - plugins:
      - name: drf
        enablePreemptable: false
      - name: predicates
      - name: capacity # add this field and remove proportion plugin.
      - name: nodeorder
      - name: binpack

You can customize your configuration according to your needs. For more information, refer to the documentation.

Install a Customized Device Plugin​

1.1 Install a Custom Device Plugin​

1.2 Set Resource Renaming Rules​

1.3 Clean Up Stale Resources​

2. Configure DCGM Exporter for Pod-Level Monitoring​

3. Configure Volcano to Use the Capacity Scheduling Plugin​